Reliability and accuracy of ultrasound image analyses completed manually versus an automated tool

Analysis of Brightness-mode ultrasound-captured fascicle angle (FA) and fascicle length (FL) can be completed manually with computer-based programs or by automated programs. Insufficient data exists regarding reliability and accuracy of automated tools. Therefore, the purpose of this study was to determine the test-retest reliability of automatic and manual ultrasound analyses, while determining accuracy of the automatic tool against the manual equivalent. Twenty-three participants (mean ± SD; age = 24 ± 4 years; height = 172.2 ± 10.5 cm; body mass = 73.1 ± 16.1 kg) completed one laboratory visit consisting of two trials where vastus lateralis muscle architecture was assessed with ultrasound. Images were taken at both lower (10 MHz) and higher frequency (12 MHz). Images were analyzed manually in an open-source imaging program and automatically using a separate open-source macro function. Test-retest reliability statistics were calculated for automatic and manual analyses. Accuracy was determined with validity statistics and were calculated for automatic analyses. The results show that manual ultrasound analyses for FA and FL for both lower and higher frequency displayed good reliability (ICC2,1 = 0.75–0.86). However, automatic ultrasound analyses for FA and FL revealed moderate reliability (ICC2,1 = 0.61–0.72) for the lower frequency images and poor reliability (ICC2,1 = 0.16–0.27) for higher frequency images. When assessed against manual techniques, automatic analyses presented greater total error (TE) and standard error of the estimate (SEE) for FA at lower frequency (constant error (CE) = −3.91°, TE = 5.57°, SEE = 3.45°) than higher (CE = −2.78°, TE = −4.54°, SEE = 2.45°). For FL, the higher frequency error (CE = 0.92 cm, TE = 2.12 cm, SEE = 1.15 cm) was similar to lower frequency error (CE = 1.98 cm, TE = 3.66 cm, SEE = 1.57 cm). The findings overall show that manual analyses had good reliability and low absolute error, while demonstrating the automated counterpart had poor to moderate reliability and large errors in analyses. These findings may be impactful as they highlight the good reliability and low error associated with manually analyzed ultrasound images and validate a novel automatic tool for analyzing ultrasound images. Future work should focus on improving reliability and decreasing error in automated image analysis tools. Automated tools are promising for the field as they eliminate biases between analysts and may be more time efficient than manual techniques.


INTRODUCTION
Brightness mode (B-mode) ultrasound is a popular imaging modality for assessing muscle size (i.e., cross-sectional area, thickness) and architecture (i.e., fascicle angle (FA), fascicle length (FL). Previous studies have shown that muscle structure which influence muscle size (i.e., FL and FA) are important to whole muscle function and force production (Lieber & Friden, 2000;Narici, 1999). Specifically, muscle architecture can provide insight, via proxy, regarding the total amount of sarcomeres in parallel (i.e., fascicle angle) and in series (i.e., fascicle length) (Maden-Wilkinson et al., 2020;Seynnes, de Boer & Narici, 2007). Previous works have shown that in pennate muscles (e.g., vastus lateralis, gastrocnemius), accounting for fascicles which are arranged at an angle relative to the tendon may be valuable as changes in fascicle angle can increase or decrease the mechanism of force generation within the muscle (Kawakami, Ichinose & Fukunaga, 1998;Kwah et al., 2013;Narici, 1999;Seynnes, de Boer & Narici, 2007). Therefore, assessing muscle size and architecture is important when assessing muscle function and force production and can easily be captured using B-mode ultrasound.
Analysis of B-mode ultrasound-captured FA and FL was classically performed manually using a protractor and printed image (Kawakami, Ichinose & Fukunaga, 1998), though contemporary computer-based image analysis options exist and are more frequently used (Carr et al., 2021). For instance, computer programs such as ImageJ (National Institutes of Health) or Photoshop (Adobe) include angle and straight line measurement tools which may be used to calculate FA and FL. Recently, automated analysis tools using various programs (i.e., MATLAB, FIJI) used to estimate FA and FL have grown in interest (Cronin et al., 2021;Seynnes & Cronin, 2020). Automated tools can be more efficient as less time has to be devoted to training technicians and some automated scripts can bulk-process images instead of the individualized approached required by manual techniques. While the automatic image analysis programs may bring many benefits to future studies, insufficient data exists regarding reliability and accuracy of automatic ultrasound analysis tools compared to manual techniques. Also, there is not a clear consensus in regard to the ultrasound frequency that should be utilized as previous studies have used 10 MHz (Carr et al., 2021;Ryan et al., 2016), while 12 MHz is also common (Mota & Stock, 2017) and may uniquely impact image quality. Therefore, the primary purpose of this study was to determine the test-retest reliability of a novel, automatic ultrasound analysis tool for the assessment of vastus lateralis muscle architecture (i.e., FA, FL). A secondary purpose of this study was to assess the accuracy of the novel, automatic ultrasound analysis tool against the manual analysis equivalent. Lastly, a tertiary purpose of this study was to examine the difference in reliability between images taken at a lower frequency (10 MHz) and higher frequency (12 MHz). It was a priori hypothesized that the novel, automatic ultrasound analysis tool for the assessment of muscle architecture would have good reliability (ICC 2,1 > 0.75) and less error compared to the manual counterpart.

METHODS Participants
Twenty-three participants (mean ± SD; age = 24 ± 4 years; height = 172.2 ± 10.5 cm; body mass = 73.1 ± 16.1 kg) were recruited from the local community for this study. Inclusion criteria consisted of individuals 18 to 35 years old and normal body mass index (BMI ≥ 30 kg/m 2 ). All participants read and signed an informed consent document explaining the risks and benefits of participating in the study. Participants were also asked to complete a health history questionnaire prior to participating in the study. This study was granted ethical approval by the University of Alabama Institutional Review Board (#21-02-4385) prior to data collection.

Experimental design
Participants completed one laboratory visit (~1 h) which consisted of two independent data collection trials (trial 1, trial 2) separated by a 10-min rest period. During each trial, Brightness mode (B-mode) ultrasound was used to take images of the vastus lateralis (VL), where muscle architecture (i.e., fascicle angle (FA) and fascicle length (FL)) of the VL was later assessed with open-source image analysis programs. During the 10-min rest period, participants remained relaxed and were not allowed to stand up or walk around the laboratory. The independent data collection trials consisted of identical methodologies and were subsequently used for the calculation of reliability statistics (see below).

Ultrasonography
Participants laid supine on a portable exam table to undergo non-invasive imaging of the right VL using a B-Mode ultrasound imaging device (LOGIQ e R8; General Electric Company, Milwaukee, WI, USA) in conjunction with a multi-frequency linear-array probe (L4 -12t -RS, 4.2-13 MHz, 47.1 mm field of view; General Electric Company, Milwaukee, WI, USA). The VL was marked at the proximal and distal musculo-tendon junctions as determined via ultrasound and length was measured with a flexible tape measure. To capture muscle architecture, a longitudinal image was taken by scanning the entire length of the VL (i.e., proximal to distal musculo-tendon junction) while using the extended field of view function. Ultrasound gain and depth settings were held constant (i.e., gain = 52 dB and depth = 6 cm) for each participant. Ultrasound frequency was systematically changed from 10 to 12 MHz (i.e., lower frequency and higher frequency, respectively). Images were systematically taken in the following order (i) frequency 10 MHz and depth 6 cm and (ii) frequency 12 MHz and depth 6 cm. Caution was taken to ensure consistent and minimal pressure was applied to the probe as it moved across the skin to not compress the underlying muscle tissue. A generous amount of acoustic coupling gel was applied to the skin to enhance image acquisition. During data collection, a minimum of two quality images were taken per ultrasound frequency. A quality ultrasound image is defined as an image where the superficial and deep aponeuroses of the muscle of interest are clearly seen as well as fascicle that is clearly captured from superficial to deep aponeuroses. During offline-image analysis (see-below) the single clearest image was used for processing.

Manual ultrasonography analysis
All images were manually analyzed in an open-source imaging computer program (ImageJ; National Institutes of Health, Bethesda, MD, USA), after scaling from pixels to centimeters. The angle tool in ImageJ was used to measure FA by tracing a clearly visible fascicle the distance between the superficial and the deep aponeuroses. The angle of interest was defined as the angle formed by the fascicle meeting the deep aponeurosis. To measure FL, ImageJ's straight-line tool was carefully placed on top of the previously described FA line to measure the length of the fascicle of interest (Fig. 1). During each measurement, the investigator carefully zoomed in to ensure the appropriate fascicle components were assessed. Both FA and FL measurements were recorded.

Automatic ultrasonography analysis
The automatic ultrasound analyses were completed using the Simple Muscle Architecture (SMA) 17 macro function in FIJI (ImageJ; National Institutes of Health, Bethesda, MD, USA). The settings for SMA were set to the guidelines recommended by the original authors (Seynnes & Cronin, 2020). The author's programming code for the SMA function is written in ImageJ macro language and is run in FIJI. FIJI is an enhanced version of the popular open-source imaging platform, "ImageJ" (Schneider, Rasband & Eliceiri, 2012). The SMA function works by four distinct processes: (i) detect field of view, (ii) detect the superficial and deep aponeuroses, (iii) measure fascicle orientation, (iv) calculate parameters. This automated process then presents the operator with FA and FL, which were recorded. For full details on this algorithm, the reader is guided to the original work of Seynnes & Cronin (2020). The automatic analyses were completed on the same images that were analyzed manually; however, if the SMA function could not locate the FA and FL using the image selected by the authors, the next best image was used until the SMA function correctly identified a FA and FL. This only occurred for one participant during trial 2, for which the SMA function could not locate a FA and FL for any image with a lower frequency. The SMA function was able to successfully analyze images for the other 22 participants for both trial 1 and 2.

Statistical analyses
Test-retest reliability statistics (i.e., intraclass correlation coefficient model 2, 1 (ICC 2,1 ), standard error of measure expressed as a percentage of the mean (SEM%), and the minimal difference (MD) needed to be considered real) were calculated between trial 1 and 2 for both the automatic and manual analyses of FA and FL, respectively. Calculations were performed according to test-retest reliability procedures described by Weir (2005). Reliability ratings were determined by the ranges described by Koo & Li (2016), that describe ICC estimates less than 0.5 as poor, 0.5 to 0.75 as moderate, 0.75 to 0.9 as good and 0.9 and greater as excellent reliability. To assess accuracy of the automated tool, validity statistics (i.e., constant error (CE), total error (TE) and standard error of the estimate (SEE)) and Bland-Altman plots were calculated for the automatic analyses using the manual analyses as criterion values for both the FA and FL images taken with a lower and higher frequency during trial 1. The error calculations were completed as follows: constant error (CE = criterion (manual analysis)comparison (automated)), total error (TE = √σ (comparison-criterion) 2 /n), and standard error of the estimate (SEE = √σ (comparison-criterion) Â √1 − r 2 ). Statistical analyses were completed using R version 4.1.0 and RStudio (Version 1.4.1717) using the "blandr", "irr", and "psych" packages. The a priori criterion for significance was set at a = 0.05.

Reliability of manual ultrasonography analyses
Manual ultrasound analyses for FA with a lower frequency had good reliability (ICC 2,1 = 0.75, Cohen's d = 0.05) and the SEM% and MD were 13.69% and 7.40 , respectively. For FL with the same image settings, the manual analyses also had good reliability (ICC 2,1 = 0.83, Cohen's d = 0.14) with the SEM% of 12.74% and the MD of 2.48 cm. The manual ultrasound analyses with a higher frequency had good reliability for FA (ICC 2,1 = 0.75, Cohen's d = 0.03) with a SEM% of 9.99% and MD of 5.38 . For FL with the same image settings, the manual analyses had good reliability (ICC 2,1 = 0.86, Cohen's d = 0.02) with the SEM% of 10.72% and MD of 2.16 cm. Further, the manual analyses for both FA and FL had good reliability with low absolute error. The reliability statistics for the manual ultrasound analyses are presented in Table 1.

Reliability of automatic ultrasonography analyses
Automatic ultrasound analyses for FA with a lower frequency had moderate reliability (ICC 2,1 = 0.61, Cohen's d = 0.01) and the SEM% and MD were 18.57% and 8.79 , respectively. For FL with the same image settings, the manual analyses also had moderate  Table 1.  Table 2. Accuracy at the individual level is presented in Bland-Altman plots (Fig. 2) for FA and FL with images taken at both lower and higher frequencies. The regression lines presented in the Bland-Altman plots were statistically significant, indicating proportional biases for higher frequency FA (R 2 = 0.22; P = 0.03), higher frequency FL (R 2 = 0.34; P = <0.01), and lower frequency FL (R 2 = 0.36; P < 0.01), but not lower frequency FA (R 2 < 0.01; P = 0.93). Table 1 Test-retest reliability of manual ultrasound analyses for the vastus lateralis. Participants completed one laboratory visit which consisted of two independent data collection trials. Images were systematically taken with various frequencies. Images were manually analyzed using ImageJ and the automatic ultrasound analyses were completed using a macro function in FIJI. (ICC 2,1 , Intraclass correlation coefficient model 2,1; SEM%, standard error of measure expressed as a percentage of the mean; MD, minimal difference values needed to be considered real).

DISCUSSION
This study examined the test-retest reliability of a novel, previously published, automatic ultrasound analysis tool for the assessment of VL muscle architecture (i.e., FA, FL). In addition to assessing reliability, this study examined the accuracy of a novel, automatic ultrasound analysis tool by using manual ultrasound analyses as the criterion. The findings from this study show that manual ultrasound analyses for FA and FL for both lower and higher frequency displayed good reliability (ICC 2,1 = 0.75-0.86) and low absolute error (SEM% = 9.99-13.69%
In the current study, automatic ultrasound analysis reliability for muscle architecture (i.e., FA, FL) were poor to moderate (ICC 2,1 = 0.16-0.72) with moderate absolute error (SEM% = 18.57-31.38). Cronin et al. (2021) reported reliability statistics of a semi-automated tracing of ultrasound images of the biceps femoris muscle architecture and reported a low coefficient of variation (CV) for the semi-automatic tracing of FA (CV% = 2.58-10.70) and FL (CV% = 0.64-1.12). It is likely that the differences between the previous studies and the present work are due to a difference in muscle architecture (i.e., FA, FL) assessment procedures. The present work utilized a single muscle fascicle for each technique (i.e., auto, manual), whereas many other studies (Akagi, Hinks & Power, 2020;Gerstner et al., 2017;Narici et al., 2021;Sarto et al., 2021) use the average of multiple fascicles. It is likely that taking an average of multiple fascicles is superior to assessing just a single fascicle. However, the use of the SMA tool in the present study only incorporates a single muscle fascicle into its muscle architecture measure (i.e., FA, FL) which was mimicked in the manual technique used in the present work. Table 2 Validity of automatic vs manual analyses of the vastus lateralis. Participants completed one laboratory visit which consisted of two independent data collection trials. Images were systematically taken with various frequencies. Images were manually analyzed using ImageJ and the automatic ultrasound analyses were completed using a macro function in FIJI. Validity of the automatic analyses were calculated using the manual analyses as criterion values for both FA and FL images taken during trial 1. Previous studies have discussed the use of ultrasound imaging as a valid modality for measuring phantom or cadaver muscle architecture (Bénard et al., 2009;Van Hooren, Teratsias & Hodson-Tole, 2020;Kawakami, Abe & Fukunaga, 1993;Kellis et al., 2009;Kwah et al., 2013).  Kellis et al., 2009) works do suggest ultrasound to be a valid technique to assess muscle architecture. While the automatic tool used in the present study is valid when compared to the manual method, future studies may wish to validate this technique against other imaging modalities (i.e., MRI) and using different ultrasound operators (Carr et al., 2021). As mentioned previously, many studies use various ultrasound frequencies. In medical imaging, ultrasound frequency is inversely proportional to its ability to penetrate tissue (Edelman, 2012). Said another way, when imaging of deep anatomical structures needed, investigators may benefit from using low-frequency ultrasound (e.g., 8-10 MHz). Conversely, higher frequency may produce higher quality ultrasound images when examining superficial tissues (Abu-Zidan, Hefny & Corr, 2011). The findings of the current study are incongruous between the lower frequency and higher frequency analyses completed with the automated tool. For example, lower frequency analyses overall had larger error than the higher frequency analyses. Perhaps, when considered with the use of superficial images in the present study, the differences in accuracy between lower and higher frequency automated image analyses in the current study is the result of higher frequency causing less contrast between tissues and overall poorer image quality; therefore, leading to more error and poorer reliability compared to the lower frequency analyses. In support of this argument, the authors of the SMA program discuss that the algorithm uses contrast and echogenicity to distinguish between tissues (Seynnes & Cronin, 2020). Future studies may wish to employ a larger range of ultrasound frequencies across different depths of tissue to better understand the effect of hardware settings on image analysis error.

Limitations
In the present study, it could not be ensured that the fascicle selected for the manual and automatic technique was identical nor that either technique assessed potentially curved fascicles without bias. Also, two separate data collection trials were needed to calculate the test-retest reliability statistics for the current study; therefore, it cannot be guaranteed that the same fascicle and its fascicle angle were captured and analyzed in both trials. The present work utilized a single, unblinded investigator who analyzed all ultrasound images. Future investigators may wish to consider the effect of multiple raters on the reliability and validity statistic presented. While averaging fascicles lengths and fascicle angles may reduce the variability of the measure, the automated program used in the current study only selected a single fascicle length and angle to measure. Therefore, the authors manually measured a single fascicle and corresponding angle. Future investigators and developers of automated tools may wish to average multiple fascicles for their analyses in order to reduce variability. Lastly, the SMA macro procedure produced some muscle architecture measurements which may not be physiological (e.g., FA > 60 ). Future investigations of automated image processing techniques may benefit with the inclusion of a "filter" so that outlying fascicle angles would be removed. However, these values are left in the final dataset of the present investigation to maximize the current applicability of the SMA algorithm for the reader.

CONCLUSION
In summary, the primary outcomes of this study indicated that manual ultrasound analyses were more reliable in comparison to the novel, automatic ultrasound analysis tool used in the present study in the calculation of FA and FL. It is recommended that the analyst is cautious when using the automated technique to analyze images taken at a higher frequency as the results indicated poor reliability for FA and FL. In addition to revealing good reliability, the manual ultrasound analyses presented less absolute error than the automatic analyses. When accuracy was assessed against manual analyses, the automatic tool presented moderate CE, TE, and SEE. These findings may be impactful as they highlight the good reliability and low absolute error associated with manually analyzed ultrasound images as well as the validation of a novel automatic tool for analyzing ultrasound images. Though promising, the reader is encouraged to employ caution in the use of automatic analysis tools for image processing until further work can be completed.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.